22 research outputs found
Efficient Snapshot Retrieval over Historical Graph Data
We address the problem of managing historical data for large evolving
information networks like social networks or citation networks, with the goal
to enable temporal and evolutionary queries and analysis. We present the design
and architecture of a distributed graph database system that stores the entire
history of a network and provides support for efficient retrieval of multiple
graphs from arbitrary time points in the past, in addition to maintaining the
current state for ongoing updates. Our system exposes a general programmatic
API to process and analyze the retrieved snapshots. We introduce DeltaGraph, a
novel, extensible, highly tunable, and distributed hierarchical index structure
that enables compactly recording the historical information, and that supports
efficient retrieval of historical graph snapshots for single-site or parallel
processing. Along with the original graph data, DeltaGraph can also maintain
and index auxiliary information; this functionality can be used to extend the
structure to efficiently execute queries like subgraph pattern matching over
historical data. We develop analytical models for both the storage space needed
and the snapshot retrieval times to aid in choosing the right parameters for a
specific scenario. In addition, we present strategies for materializing
portions of the historical graph state in memory to further speed up the
retrieval process. Secondly, we present an in-memory graph data structure
called GraphPool that can maintain hundreds of historical graph instances in
main memory in a non-redundant manner. We present a comprehensive experimental
evaluation that illustrates the effectiveness of our proposed techniques at
managing historical graph information
HISTORICAL GRAPH DATA MANAGEMENT
Over the last decade, we have witnessed an increasing interest in temporal analysis of information networks such as social networks or citation networks. Finding temporal interaction patterns, visualizing the evolution of graph properties, or even simply comparing them across time, has proven to add significant value in reasoning over networks. However, because of the lack of underlying data management support, much of the work on large-scale graph analytics to date has largely focused on the study of static properties of graph snapshots. Unfortunately, a static view of interactions between entities is often an oversimplification of several complex phenomena like the spread of epidemics, information diffusion, formation of online communities, and so on. In the absence of appropriate support, an analyst today has to manually navigate the added temporal complexity of large evolving graphs, making the process cumbersome and ineffective.
In this dissertation, I address the key challenges in storing, retrieving, and analyzing large historical graphs. In the first part, I present DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compact recording of the historical information, and that supports efficient retrieval of historical graph snapshots. I present analytical models for estimating required storage space and snapshot retrieval times which aid in choosing the right parameters for a specific scenario. I also present optimizations such as partial materialization and columnar storage to speed up snapshot retrieval. In the second part, I present Temporal Graph Index that builds upon DeltaGraph to support version-centric retrieval such as a node’s 1-hop neighborhood history, along with snapshot reconstruction. It provides high scalability, employing careful partitioning, distribution, and replication strategies that effectively deal with temporal and topological skew, typical of temporal graph datasets. In the last part of the dissertation, I present Temporal Graph Analysis Framework that enables analysts to effectively express a variety of complex historical graph analysis tasks using a set of novel temporal graph operators and to execute them in an efficient and scalable manner on a cloud. My proposed solutions are engineered in the form of a framework called the Historical Graph Store, designed to facilitate a wide variety of large-scale historical graph analysis
Trust in AutoML: Exploring Information Needs for Establishing Trust in Automated Machine Learning Systems
We explore trust in a relatively new area of data science: Automated Machine
Learning (AutoML). In AutoML, AI methods are used to generate and optimize
machine learning models by automatically engineering features, selecting
models, and optimizing hyperparameters. In this paper, we seek to understand
what kinds of information influence data scientists' trust in the models
produced by AutoML? We operationalize trust as a willingness to deploy a model
produced using automated methods. We report results from three studies --
qualitative interviews, a controlled experiment, and a card-sorting task -- to
understand the information needs of data scientists for establishing trust in
AutoML systems. We find that including transparency features in an AutoML tool
increased user trust and understandability in the tool; and out of all proposed
features, model performance metrics and visualizations are the most important
information to data scientists when establishing their trust with an AutoML
tool.Comment: IUI 202
Sense-making strategies in explorative intelligence analysis of network evolutions
Visualising how social networks evolve is important in intelligence analysis in order to detect and monitor issues, such as emerging crime patterns or rapidly growing groups of offenders. It remains an open research question how this type of information should be presented for visual exploration. To get a sense of how users work with different types of visualisations, we evaluate a matrix and a node-link diagram in a controlled thinking aloud study. We describe the sense-making strategies that users adopted during explorative and realistic tasks. Thereby, we focus on the user behaviour in switching between the two visualisations and propose a set of nine strategies. Based on a qualitative and quantitative content analysis we show which visualisation supports which strategy better. We find that the two visualisations clearly support intelligence tasks and that for some tasks the combined use is more advantageous than the use of an individual visualisation